word sense
A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin
Jin, Xiaoyun, Ernestus, Mirjam, Baayen, R. Harald
We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words' meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (17 more...)
- Health & Medicine (0.67)
- Education (0.67)
- Leisure & Entertainment (0.46)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > Spain (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages
This paper delves into the intricate world of Urdu poetry, exploring its thematic depths through a lens of polysemy. By focusing on the nuanced differences between three seemingly synonymous words (pyaar, muhabbat, and ishq) we expose a spectrum of emotions and experiences unique to the Urdu language. This study employs a polysemic case study approach, meticulously examining how these words are interwoven within the rich tapestry of Urdu poetry. By analyzing their usage and context, we uncover a hidden layer of meaning, revealing subtle distinctions which lack direct equivalents in English literature. Furthermore, we embark on a comparative analysis, generating word embeddings for both Urdu and English terms related to love. This enables us to quantify and visualize the semantic space occupied by these words, providing valuable insights into the cultural and linguistic nuances of expressing love. Through this multifaceted approach, our study sheds light on the captivating complexities of Urdu poetry, offering a deeper understanding and appreciation for its unique portrayal of love and its myriad expressions
- North America > United States (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Pakistan (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > Spain (0.04)
Towards Universal Semantics With Large Language Models
Baartmans, Raymond, Raffel, Matthew, Vikram, Rahul, Deringer, Aiden, Chen, Lizhong
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Pennsylvania (0.04)
- (4 more...)
Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples
Škvorc, Tadej, Robnik-Šikonja, Marko
Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.05)
- Europe > Slovenia > Savinja > Municipality of Celje > Celje (0.04)
- Asia (0.04)
- Overview (0.93)
- Research Report > New Finding (0.68)
Reviews: Visualizing and Measuring the Geometry of BERT
Originality: This submission uses existing techniques to analyze how syntax and semantics are represented in BERT. The authors do a good job of contextualizing the work in terms of previous work, for instance similar analyses for other models (like Word2Vec). They also build off of the work of Hewitt and Manning and provide new theoretical justification for Hewitt and Manning's empirical findings. Quality: Their mathematical arguments are sound, but the authors could add more rigor to the conclusions they draw in the remarks following Theorem 1. The empirical studies show some interesting results.
Word Sense Linking: Disambiguating Outside the Sandbox
Bejgu, Andrei Stefan, Barba, Edoardo, Procopio, Luigi, Fernández-Castro, Alberte, Navigli, Roberto
Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable meaning.We put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > Dominican Republic (0.04)
- (13 more...)
A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin
Jin, Xiaoyun, Ernestus, Mirjam, Baayen, R. Harald
In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on Chuang et al. (2024) research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words' meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a corpus of spontaneous conversational Taiwan Mandarin, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. These component pitch contours isolate the contributions to the pitch contour of the variables taken into account in the statistical model. We show that the tones immediately to the left and right of a word substantially modify a word's canonical tone. Once the effect of tonal context is controlled for, the canonical rising (T2) and dipping (T3) tones emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions is taken to primarily depend for its realization on the preceding tone, emerges as a low tone in its own right, the realization of which is modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. In line with the results from a previous study on disyllabic words with the T2-T4 tonal contour (Chuang et al., 2024), we also show that word, and even more so, word sense, co-determine words' F0 contours, and that, as a consequence, heterographic homophones (e.g., 的, 得, and 地) have their own tonal signatures. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense that is almost as important as that of tonal context.
- Asia > Taiwan (0.63)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > Massachusetts (0.04)
- (10 more...)
Coarse-Grained Sense Inventories Based on Semantic Matching between English Dictionaries
Kikuchi, Masato, Ono, Masatsugu, Soga, Toshioki, Tanabe, Tetsu, Ozono, Tadachika
WordNet is one of the largest handcrafted concept dictionaries visualizing word connections through semantic relationships. It is widely used as a word sense inventory in natural language processing tasks. However, WordNet's fine-grained senses have been criticized for limiting its usability. In this paper, we semantically match sense definitions from Cambridge dictionaries and WordNet and develop new coarse-grained sense inventories. We verify the effectiveness of our inventories by comparing their semantic coherences with that of Coarse Sense Inventory. The advantages of the proposed inventories include their low dependency on large-scale resources, better aggregation of closely related senses, CEFR-level assignments, and ease of expansion and improvement.